**Cache Replacement Policy Using Frequency Information**

Yang Wang (321004893)

*{wyang1989}@tamu.edu*

**ABSTRACT**

*The widely used LRU replacement policy has the following problems. First, LRU does not use frequency information of cache access. This makes LRU easy to be polluted by infrequently used blocks, which evict frequently used blocks from LRU stack. Second, LRU may work poorly when the there is a lot of cycle pattern cache access and cache capacity is less than the working set.*

*However efficient last-level cache utilization is crucial to avoid long latency cache misses to main memory. Always predicting a near-immediate re-reference interval on all cache insertions limits cache performance on the other hand always predicting a distant re-reference interval significantly decreases cache performance for access pattern that predominately have a near-immediate re-reference interval.*

*For this matter, a cache replacement based on Re-reference Interval Prediction (RRIP) may be used to solve the problems. This cache replacement policy is called Re-Reference Interval Prediction replacement policy. And in this project, I apply this idea to L2 cache, which is the last level cache of my project. The performance and sensitivities of this replacement policy is analyzed.*

**1. INTRODUCTION**

An ideal replacement policy should make its replacement decision using the pattern of each block and replace the block that will be re-referenced furthest in the future. However there is no perfect replacement policy. Current replacement policies make decision on certain kind of assumption of which block will be re-referenced furthest in the future. Generally when misses happened replacement policies predicts the re-referenced possibility of this block in future.

The commonly used LRU (Least Recently Used) replacement policy record the pattern of recent referenced blocks. When a block needs to be replaced, LRU will evict the block at the tail of LRU chain and put new block at the head of that chain. The block at the head of the LRU chain is predicted to be near-immediate re-reference, which means it is assumed to be re-referenced again in soonest and the block at the tail implies that block might be re-reference in the distant future, which means it might be re-referenced again in longest future. So when a new block is inserted into LRU chain, LRU predicts it will be re-referenced in immediate future and put it at the head of the chain. LRU works well on workloads with high locality, but it cannot make right prediction when re-references mostly occurs in distant future. Benchmark usually performs poorly under LRU when its reference occurs in the distant future. The limit of LRU is cause by following two aspects.

First, LRU only records the present of blocks without frequency information. LRU assumes that the most recently used blocks will have the highest probability to be used again in the near future. This makes it is easy to be polluted by some infrequently used blocks. Some blocks might be used only few times, but they will replace the blocks with distant references, which might be used a lot. Considering the following access pattern: A, B, B, A, C, D, E, F, A, B will be put in same set with 4 blocks. Even though 'A' and 'B' appears 2 times, but they will be evicted by less referenced block 'E' and 'F'.

Second, LRU may suffer from cache thrashing for cyclic access patterns when the working set is greater than the available cache size because this kind of pattern will traverse LRU chain. For this matter, this project explores a different kind of replacement policy, which record the frequency information of each block as well. When a block needs to be replaced, this policy will choose the block occurs least up to now and replace it. This replacement policy is based on Re-reference Interval Prediction (RRIP).

RRIP prevents cache blocks with distant re-reference interval from evicting blocks that have near-immediate re-reference interval. This is implemented by using M-bit information per block to store its Re-reference Prediction Value (RRPV). This project use RRIP method to make comparison between RRIP based replacement policy and LRU policy and also explore the sensitivity of RRIP to different parameters.

**2. MOTIVIATION**

Reducing the miss rate of last-level cache is important since it can avoid long latency cache misses to main memory. However commonly used LRU replacement policy does not perform well on workloads with many distant reference blocks. To better understand when LRU performs poorly, we should understand the feature of different kind of pattern.

(a) LRU-friendly Access Pattern

(b) Thrashing Access Pattern

(c) Infinite Access Pattern

**Figure 1: Cache Access Pattern**

Fig1(a) shows a LRU friendly access pattern. This access has a near-immediate re-reference interval. For any value of k, the access pattern can be well predicted by LRU policy. Fig1(b) shows a thrashing access pattern. It is a cyclic access and when N is large enough LRU will make 0 cache hits due to cache thrashing. For example like the following access way: "A, B, C, D, E, F, A, B, C, D, E, F, A, B, C, D, E, F" There is a cyclic access "A, B, C, D, E, F". If cache set has 3 blocks, then the access will like this:

|  |  |  |  |  |
| --- | --- | --- | --- | --- |
| Next Ref | Cache Set | | |  |
| A |  |  |  | Miss |
| B | A |  |  | Miss |
| C | B | A |  | Miss |
| D | C | B | A | Miss |
| E | D | C | B | Miss |
| F | E | D | C | Miss |
| A | F | E | D | Miss |
| B | A | F | E | Miss |
| ... | ... | ... | ... | Miss... |

Fig1(c) presents a streaming access pattern, when k is unlimited large, the access pattern has no locality. This access pattern will receive no cache hits.

So we can regards any workload as a mix access pattern of above 3 kind of access way. Then if the 2nd access pattern is dominant, the LRU will perform poorly. But if the LRU friendly access pattern is dominating, LRU should work well. However, RRIP will works well for both 1st and 2nd access patter, so there is improvement space for LRU.

**3. Re-Reference Interval Prediction (RRIP)**

Both recent and frequency information are important to cache replacement, while LRU only used the recent information. When benchmark is a mix access pattern, LRU cannot distinguish between blocks that have a distant re-reference interval from blocks that have a near-immediate re-reference interval.

Always predicting a near-immediate re-reference interval on all cache insertions limits the cache performance. Since blocks will occupy the cache space without receiving any hits. On the other hand, always predicting a distant-immediate re-reference interval also significantly decreases the cache performance.

RRIP uses M-bits per block information to record possibility Re-reference Prediction Values (RRPV). RRIP will dynamically learn re-reference information for each block in the cache access pattern. When RRPV=0, it means this block will be re-reference in the near-immediate future. While when RRPV equals means this block will be re-reference in the furthest future. So all the value between 0 to indicated the different re-reference possibility from near-immediate to furthest re-reference. For instance, with 2 bits information (RRPV range from 0 to 3), 0 predicts the cache block will be reference in the soonest future and 3 predicts the cache block will be reference in longest future.

Then higher M value means more bits to represent the pattern of each block. And it can discriminate more different patterns. However, the performance of this replacement policy does not always go up with higher M value. I will explore the sensitivity to M value in later.

On initial all blocks is set to highest RRPV value. When a new block is inserted into cache, RRIP reduce the value of this block by 1. And when this block is re-referenced, RRPV is set directly to 0 or reduced by 1 again. These two different choices will be analyzed in later. In this way, RRIP learns the block’s re-reference pattern. When a block needs to be replaced, RRIP always chose the first block with highest prediction value. If there is no highest value, then increasing all the value of each block by 1 and repeat till one block reaches the highest value.

The following is an example for the decision procedure with 2-bit RRPV information. So when the value is 0, it means the block will be re-referenced in near-immediate and value 3 represents that block might be re-referenced in long time later.

The decision process (With 2 bit RRPV information):

(1) Cache Hit:

Set RRPV of block to ‘0’

(2) Cache Miss:

1. Search for first block with ‘3’ from left

2. If ‘3’ found go to step 5.

3. Increase all RRPV by 1.

4. Go to step 1.

5. Replace block and decrease RRPV by 1.

One thing should be mentation is how we set the RRPV when there is a hit. Generally there are two different ways. One is set RRPV to 0 once there is a hit; the other way is reducing RRPV by 1 till 0 on each hit.

Another thing is how we set the RRPV when there is a hit. There are 3 choices. One is set it to 0 for the new block brought into, which means we predict this block will be re-referenced in nearest future. Or set it to , which means we predict this block will be re-referenced in longest future. Or set it to , which means we predicts this block might be re-referenced in long future not nearest nor longest. Also I will explore this sensitivity in later chapter.

**4. Methodology**

**4.1 Simulator**

I use SimpleScalar for this project to do the simulation. I use 2 level memory hierarchy. Level 1 cache is a 4-way 32KB with 64B block size. L1 cache sizes are kept constant in this study.

**4.2 Benchmarks**

I use 2 float point benchmark equake and art and 2 integer benchmark gcc and gzip.

**4.3 Project Content**

The following chapter will explore different static RRIP sensitivity. First we will explore the sensitivity to RRPV on insertion and decide which strategy we should use to set the RRPV when we bring in new block. Second we will decide which strategy we should use to set the RRPV when there is a hit. Third exploring the sensitivity to M value. Forth exploring the sensitivity to L2 cache configuration. And finally probe the sensitivity to associativity.

**4.4 Implementation**

This replacement policy is implemented by revising the cache.c, cache.h and simoutorder.c file. The main idea is to add re-reference prediction value (RRPV) to each block, that is the cache\_blk\_t structure in cache.h. Then it is initialized to at first. And in cache.c, the code will handle the miss and hit cases.

**5. Results and Analysis**

**5.1 RRIP sensitivity to RRPV on insertion**

Re-reference soonest Re-reference longer Re-reference longest

0

**Figure 2: Re-Reference Prediction value**

In this section, I first investigate the performance of this replacement policy by changing the re-reference prediction value, which is the value when we first insert a block into cache set. As mentioned above, there are 3 different choices. Set 0 (Predicting immediate), Set (Predicting distant re-reference) and Set (Predicting long re-reference). We can see from Fig 2, if this value is set to 0 that means predicting this newly inserted block will be re-referenced immediately, which is what LRU does. If re-reference prediction value is set to that means predicting this newly inserted block will not be re-referenced in near future. And if this value is set to that means predicting the re-reference possibility of this newly inserted is neither the soonest nor the longest.

The configuration for L2 cache will be 16-way associativity and 64B block size for the rest if this project without explicitly mention. The cache size changes among 1MB, 2MB and 4MB. And M=2, that means there are 4 bits to record frequency information. Value 0 stands for immediate re-reference, 2 means long re-reference and 3 means distant (longest) re-reference.

**Figure 3: Sensitivity to re-reference prediction value.**

As we can see from Fig 3, always predicting long re-reference, that is, setting RRPV to (in this experiment it is 2) constantly yield the best performance. And always predicting the distant re-reference gives the worst performance, that is, setting PPPV to when we insert new block into cache. This feature can be explained by the replacing strategy. RRIP will replace a block with value (in current experiment it is 3) when a block must be replaced. So if we set the new block value to when we insert it to cache. There will not be enough time to reduce the RRPV. And it might be replaced immediately by the RRIP strategy. And predicting the block to be long re-referenced will give RRIP enough time to learn the block pattern and do not pollute cache.

So in this project we will always use long RRIP insertion configuration for rest analysis.

**5.2 Sensitivity to Hit Priority And Frequency Priority**

|  |  |  |
| --- | --- | --- |
| Next Reference | RRPV of certain block | Action |
| a1 | 2 | Miss |
| a1 | 0 | Hit |
| a1 | 0 | Hit |
| ... | ... | ... |

Hit Priority: (Using M=2, so there are 4 different RRPV value. A newly inserted block RRPV is always set to 2 . When hit happen, RRPV is set to 0 directly)

|  |  |  |
| --- | --- | --- |
| Next Reference | RRPV of certain block | Action |
| a1 | 2 | Miss |
| a1 | 1 | Hit |
| a1 | 0 | Hit |
| ... | ... | ... |

Frequency Priority: (Using M=2, so there are 4 different RRPV value. A newly inserted block RRPV is always set to 2 When hit happen, RRPV is reduced by one each time.)

In last section, what choice should be made when new block is inserted into the cache set is decided. So in this section, the action of setting RRPV is discussed when a hit happened. There are 2 different way to handle the hit situation. As shown in above table, one is to set re-referenced predicting value directly to 0 when there is a hit called hit priority. Another is to reduce that value by one till zero when there is a hit called frequency priority. So the frequency priority will record more information about the frequency pattern of each block than the hit priority. In summary hit priority predicts a hit block will be re-referenced in near-immediate re-reference and frequency priority predicts not all hit block have a near-immediate re-reference but only the frequently hit block have near-immediate re-reference.

**Figure 4: Sensitivity to hit priority and frequent priority**

Fig 4 shows the results of RRIP performance on HP and FP under different L2 cache size. Both float point and integer benchmark works better on frequency priority. The reason for this kind of pattern shows the benefits of this algorithm is from recording the frequently referenced block. Allowing more time to learn whether a block is really near-immediate re-reference is important. So for the following project, the FP will always be used for simulation.

**5.3 Sensitivity to M value**

In each block we use value to record the reference frequency, which predicts the re-reference possibility of each block. Value 0 means the highest possibility of immediate re-reference and value means the lowest so that block with value will always be chosen to replace by RRIP because it is predicted not to be reference again in near future. For this matter, with M value increase, there will be more bits to record the frequency information, making a frequently referenced block less likely to be polluted.

**Figure 5: M-value sensitivity study**

Above figure indicates that this replacement policy does not perform better with higher M value. That means even though there is more bits to record the frequency information, it does not help reducing miss rate. And in this project, it shows that when M=3, the RRIP performance best. When the cache size is more than 2MB, miss rate does not change with different M value. This feature might caused by the compulsory miss rate, which limits the performance of RRIP. Then M value should be decided for different cache configuration and different benchmark for best performance. So for the rest of this project, I will choose M=3 for RRIP configuration.

**5.4 Sensitivity to Cache Configuration**

Given the results and anaysis of above three section, I have probe the sensitivity of this replacement policy to its own configuration. Now it is time to analyze the performance by varying cache size.

In this section, I explore the sensitivity of RRIP on different cache size: 256K, 512K, 1MB, 2MB. All the cahce is 16-way associativity with 64B block size. And the performance imporvement is compared to LRU. Two integer benchmark is used. And one float point benchmark is used. The configuration for RRIP is 3 bits RRPV information (M=3) and frequency priority with long re-reference predicting on each insertion.

**Figure 6: Performance analysis by varying L2 cache size.**

The above graph shows the performance difference between LRU and RRIP among different L2 cache size. First we can see that RRIP replacement policy outperform LRU in some cases. And for other cases, it is at least as well as LRU. Also it shows that for integer benchmark RRIP performance better for small cache size and for float point benchmark RRIP performance better for larger cache size. So the performance of RRIP is also dependent on different benchmark. If all the data in that benchmark is LRU friendly, then the advantage of RRIP cannot be revealed.

In this study, RRIP works constantly better on all benchmark and work best on benchmark art with 1MB L2 cache. This can be explained by there might be a lot scan-pattern on benchmark art. Also this study shows the advantage of RRIP might not be seen for small cache size.

**5.5 Sensitivity to Associativity**

At last section, I use following parameters for study the sensitivity of RRIP to associativity. L2 cache is 1MB with 64B block size and the associativity is: 4way, 8way, 16way and 32way.

**Figure 7: Performance analysis under different associativity.**

As it indicates in the figure, RRIP outperform LRU about 30-40%. The best performance happens on 4-way associativity for 1MB L2 cache. For 2-way associativity, the advantage of RRIP cannot be fully revealed since there are only 2 blocks in each set. So in any way one of the two blocks will be replaced when a new block must be inserted into current set. However the reason for high associativity also does not yield the best performance has nothing to do with RRIP.

**6. SUMMARY**

Cache replacement policies try to replace the block with least possibility of immediate re-reference. The commonly used LRU replacement policy always predicts a newly inserted block with immediate re-reference, that means always predicting a block with the highest possibility of re-reference soonest in the future. However, in fact it is not always true. Some frequently accessed block may be eliminated by block with distant re-reference interval. So in this project, I conduct a experiment by adding frequency information into each block. When a block should be replaced, this re-reference interval prediction replacement policy will always choose the block with lowest possibility of sooner re-reference again in the future. In this way, the frequently reference blocks are prevent from polluting by less frequently reference blocks. The sensitivity of this replacement policy to different configuration is explored. The performance is also compared to LRU and shows this policy can outperform in different configuration.

**7. ACKNOWLEDGEMENTS**

The core idea and the decision procedure of this Re-Reference Interval Prediction replacement policy is from Aamer Jaleel's paper.

**9. REFERENCES**

[1] Aamer Jaleel. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). *ISCA’10*, June 19-23, 2010.

**Appendix A: Simulation Code**

cache.h

/\* cache.h - cache module interfaces \*/

#ifndef CACHE\_H

#define CACHE\_H

#include <stdio.h>

#include "host.h"

#include "misc.h"

#include "machine.h"

#include "memory.h"

#include "stats.h"

/\* highly associative caches are implemented using a hash table lookup to

speed block access, this macro decides if a cache is "highly associative" \*/

#define CACHE\_HIGHLY\_ASSOC(cp) ((cp)->assoc > 4)

/\* cache replacement policy \*/

enum cache\_policy {

LRU, /\* replace least recently used block (perfect LRU) \*/

Random, /\* replace a random block \*/

FIFO, /\* replace the oldest block in the set \*/

/\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*/

SRRIP /\*Re-Reference Interval Prediction\*/

/\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*/

};

/\* block status values \*/

#define CACHE\_BLK\_VALID 0x00000001 /\* block in valid, in use \*/

#define CACHE\_BLK\_DIRTY 0x00000002 /\* dirty block \*/

/\* cache block (or line) definition \*/

struct cache\_blk\_t

{

//\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

// Add RRPV to cache block

//\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

int rrpv;

//\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

struct cache\_blk\_t \*way\_next; /\* next block in the ordered way chain, used

to order blocks for replacement \*/

struct cache\_blk\_t \*way\_prev; /\* previous block in the order way chain \*/

struct cache\_blk\_t \*hash\_next;/\* next block in the hash bucket chain, only

used in highly-associative caches \*/

/\* since hash table lists are typically small, there is no previous

pointer, deletion requires a trip through the hash table bucket list \*/

md\_addr\_t tag; /\* data block tag value \*/

unsigned int status; /\* block status, see CACHE\_BLK\_\* defs above \*/

tick\_t ready; /\* time when block will be accessible, field

is set when a miss fetch is initiated \*/

byte\_t \*user\_data; /\* pointer to user defined data, e.g.,

pre-decode data or physical page address \*/

/\* DATA should be pointer-aligned due to preceeding field \*/

/\* NOTE: this is a variable-size tail array, this must be the LAST field

defined in this structure! \*/

byte\_t data[1]; /\* actual data block starts here, block size

should probably be a multiple of 8 \*/

};

/\* cache set definition (one or more blocks sharing the same set index) \*/

struct cache\_set\_t

{

struct cache\_blk\_t \*\*hash; /\* hash table: for fast access w/assoc, NULL

for low-assoc caches \*/

struct cache\_blk\_t \*way\_head; /\* head of way list \*/

struct cache\_blk\_t \*way\_tail; /\* tail pf way list \*/

struct cache\_blk\_t \*blks; /\* cache blocks, allocated sequentially, so

this pointer can also be used for random

access to cache blocks \*/

};

/\* cache definition \*/

struct cache\_t

{

/\* parameters \*/

char \*name; /\* cache name \*/

int nsets; /\* number of sets \*/

int bsize; /\* block size in bytes \*/

int balloc; /\* maintain cache contents? \*/

int usize; /\* user allocated data size \*/

int assoc; /\* cache associativity \*/

enum cache\_policy policy; /\* cache replacement policy \*/

unsigned int hit\_latency; /\* cache hit latency \*/

/\* miss/replacement handler, read/write BSIZE bytes starting at BADDR

from/into cache block BLK, returns the latency of the operation

if initiated at NOW, returned latencies indicate how long it takes

for the cache access to continue (e.g., fill a write buffer), the

miss/repl functions are required to track how this operation will

effect the latency of later operations (e.g., write buffer fills),

if !BALLOC, then just return the latency; BLK\_ACCESS\_FN is also

responsible for generating any user data and incorporating the latency

of that operation \*/

unsigned int /\* latency of block access \*/

(\*blk\_access\_fn)(enum mem\_cmd cmd, /\* block access command \*/

md\_addr\_t baddr, /\* program address to access \*/

int bsize, /\* size of the cache block \*/

struct cache\_blk\_t \*blk, /\* ptr to cache block struct \*/

tick\_t now); /\* when fetch was initiated \*/

/\* derived data, for fast decoding \*/

int hsize; /\* cache set hash table size \*/

md\_addr\_t blk\_mask;

int set\_shift;

md\_addr\_t set\_mask; /\* use \*after\* shift \*/

int tag\_shift;

md\_addr\_t tag\_mask; /\* use \*after\* shift \*/

md\_addr\_t tagset\_mask; /\* used for fast hit detection \*/

/\* bus resource \*/

tick\_t bus\_free; /\* time when bus to next level of cache is

free, NOTE: the bus model assumes only a

single, fully-pipelined port to the next

level of memory that requires the bus only

one cycle for cache line transfer (the

latency of the access to the lower level

may be more than one cycle, as specified

by the miss handler \*/

/\* per-cache stats \*/

counter\_t hits; /\* total number of hits \*/

counter\_t misses; /\* total number of misses \*/

counter\_t replacements; /\* total number of replacements at misses \*/

counter\_t writebacks; /\* total number of writebacks at misses \*/

counter\_t invalidations; /\* total number of external invalidations \*/

/\* last block to hit, used to optimize cache hit processing \*/

md\_addr\_t last\_tagset; /\* tag of last line accessed \*/

struct cache\_blk\_t \*last\_blk; /\* cache block last accessed \*/

/\* data blocks \*/

byte\_t \*data; /\* pointer to data blocks allocation \*/

/\* NOTE: this is a variable-size tail array, this must be the LAST field

defined in this structure! \*/

struct cache\_set\_t sets[1]; /\* each entry is a set \*/

};

/\* create and initialize a general cache structure \*/

struct cache\_t \* /\* pointer to cache created \*/

cache\_create(char \*name, /\* name of the cache \*/

int nsets, /\* total number of sets in cache \*/

int bsize, /\* block (line) size of cache \*/

int balloc, /\* allocate data space for blocks? \*/

int usize, /\* size of user data to alloc w/blks \*/

int assoc, /\* associativity of cache \*/

enum cache\_policy policy, /\* replacement policy w/in sets \*/

/\* block access function, see description w/in struct cache def \*/

unsigned int (\*blk\_access\_fn)(enum mem\_cmd cmd,

md\_addr\_t baddr, int bsize,

struct cache\_blk\_t \*blk,

tick\_t now),

unsigned int hit\_latency);/\* latency in cycles for a hit \*/

/\* parse policy \*/

enum cache\_policy /\* replacement policy enum \*/

cache\_char2policy(char c); /\* replacement policy as a char \*/

/\* print cache configuration \*/

void

cache\_config(struct cache\_t \*cp, /\* cache instance \*/

FILE \*stream); /\* output stream \*/

/\* register cache stats \*/

void

cache\_reg\_stats(struct cache\_t \*cp, /\* cache instance \*/

struct stat\_sdb\_t \*sdb);/\* stats database \*/

/\* print cache stats \*/

void

cache\_stats(struct cache\_t \*cp, /\* cache instance \*/

FILE \*stream); /\* output stream \*/

/\* print cache stats \*/

void cache\_stats(struct cache\_t \*cp, FILE \*stream);

/\* access a cache, perform a CMD operation on cache CP at address ADDR,

places NBYTES of data at \*P, returns latency of operation if initiated

at NOW, places pointer to block user data in \*UDATA, \*P is untouched if

cache blocks are not allocated (!CP->BALLOC), UDATA should be NULL if no

user data is attached to blocks \*/

unsigned int /\* latency of access in cycles \*/

cache\_access(struct cache\_t \*cp, /\* cache to access \*/

enum mem\_cmd cmd, /\* access type, Read or Write \*/

md\_addr\_t addr, /\* address of access \*/

void \*vp, /\* ptr to buffer for input/output \*/

int nbytes, /\* number of bytes to access \*/

tick\_t now, /\* time of access \*/

byte\_t \*\*udata, /\* for return of user data ptr \*/

md\_addr\_t \*repl\_addr); /\* for address of replaced block \*/

/\* cache access functions, these are safe, they check alignment and

permissions \*/

#define cache\_double(cp, cmd, addr, p, now, udata) \

cache\_access(cp, cmd, addr, p, sizeof(double), now, udata)

#define cache\_float(cp, cmd, addr, p, now, udata) \

cache\_access(cp, cmd, addr, p, sizeof(float), now, udata)

#define cache\_dword(cp, cmd, addr, p, now, udata) \

cache\_access(cp, cmd, addr, p, sizeof(long long), now, udata)

#define cache\_word(cp, cmd, addr, p, now, udata) \

cache\_access(cp, cmd, addr, p, sizeof(int), now, udata)

#define cache\_half(cp, cmd, addr, p, now, udata) \

cache\_access(cp, cmd, addr, p, sizeof(short), now, udata)

#define cache\_byte(cp, cmd, addr, p, now, udata) \

cache\_access(cp, cmd, addr, p, sizeof(char), now, udata)

/\* return non-zero if block containing address ADDR is contained in cache

CP, this interface is used primarily for debugging and asserting cache

invariants \*/

int /\* non-zero if access would hit \*/

cache\_probe(struct cache\_t \*cp, /\* cache instance to probe \*/

md\_addr\_t addr); /\* address of block to probe \*/

/\* flush the entire cache, returns latency of the operation \*/

unsigned int /\* latency of the flush operation \*/

cache\_flush(struct cache\_t \*cp, /\* cache instance to flush \*/

tick\_t now); /\* time of cache flush \*/

/\* flush the block containing ADDR from the cache CP, returns the latency of

the block flush operation \*/

unsigned int /\* latency of flush operation \*/

cache\_flush\_addr(struct cache\_t \*cp, /\* cache instance to flush \*/

md\_addr\_t addr, /\* address of block to flush \*/

tick\_t now); /\* time of cache flush \*/

#endif /\* CACHE\_H \*/

**cache.c**

#include <stdio.h>

#include <stdlib.h>

#include <assert.h>

#include "host.h"

#include "misc.h"

#include "machine.h"

#include "cache.h"

/\* cache access macros \*/

#define CACHE\_TAG(cp, addr) ((addr) >> (cp)->tag\_shift)

#define CACHE\_SET(cp, addr) (((addr) >> (cp)->set\_shift) & (cp)->set\_mask)

#define CACHE\_BLK(cp, addr) ((addr) & (cp)->blk\_mask)

#define CACHE\_TAGSET(cp, addr) ((addr) & (cp)->tagset\_mask)

/\* extract/reconstruct a block address \*/

#define CACHE\_BADDR(cp, addr) ((addr) & ~(cp)->blk\_mask)

#define CACHE\_MK\_BADDR(cp, tag, set) \

(((tag) << (cp)->tag\_shift)|((set) << (cp)->set\_shift))

/\* index an array of cache blocks, non-trivial due to variable length blocks \*/

#define CACHE\_BINDEX(cp, blks, i) \

((struct cache\_blk\_t \*)(((char \*)(blks)) + \

(i)\*(sizeof(struct cache\_blk\_t) + \

((cp)->balloc \

? (cp)->bsize\*sizeof(byte\_t) : 0))))

/\* cache data block accessor, type parameterized \*/

#define \_\_CACHE\_ACCESS(type, data, bofs) \

(\*((type \*)(((char \*)data) + (bofs))))

/\* cache data block accessors, by type \*/

#define CACHE\_DOUBLE(data, bofs) \_\_CACHE\_ACCESS(double, data, bofs)

#define CACHE\_FLOAT(data, bofs) \_\_CACHE\_ACCESS(float, data, bofs)

#define CACHE\_WORD(data, bofs) \_\_CACHE\_ACCESS(unsigned int, data, bofs)

#define CACHE\_HALF(data, bofs) \_\_CACHE\_ACCESS(unsigned short, data, bofs)

#define CACHE\_BYTE(data, bofs) \_\_CACHE\_ACCESS(unsigned char, data, bofs)

/\* cache block hashing macros, this macro is used to index into a cache

set hash table (to find the correct block on N in an N-way cache), the

cache set index function is CACHE\_SET, defined above \*/

#define CACHE\_HASH(cp, key) \

(((key >> 24) ^ (key >> 16) ^ (key >> 8) ^ key) & ((cp)->hsize-1))

/\* copy data out of a cache block to buffer indicated by argument pointer p \*/

#define CACHE\_BCOPY(cmd, blk, bofs, p, nbytes) \

if (cmd == Read) \

{ \

switch (nbytes) { \

case 1: \

\*((byte\_t \*)p) = CACHE\_BYTE(&blk->data[0], bofs); break; \

case 2: \

\*((half\_t \*)p) = CACHE\_HALF(&blk->data[0], bofs); break; \

case 4: \

\*((word\_t \*)p) = CACHE\_WORD(&blk->data[0], bofs); break; \

default: \

{ /\* >= 8, power of two, fits in block \*/ \

int words = nbytes >> 2; \

while (words-- > 0) \

{ \

\*((word\_t \*)p) = CACHE\_WORD(&blk->data[0], bofs); \

p += 4; bofs += 4; \

}\

}\

}\

}\

else /\* cmd == Write \*/ \

{ \

switch (nbytes) { \

case 1: \

CACHE\_BYTE(&blk->data[0], bofs) = \*((byte\_t \*)p); break; \

case 2: \

CACHE\_HALF(&blk->data[0], bofs) = \*((half\_t \*)p); break; \

case 4: \

CACHE\_WORD(&blk->data[0], bofs) = \*((word\_t \*)p); break; \

default: \

{ /\* >= 8, power of two, fits in block \*/ \

int words = nbytes >> 2; \

while (words-- > 0) \

{ \

CACHE\_WORD(&blk->data[0], bofs) = \*((word\_t \*)p); \

p += 4; bofs += 4; \

}\

}\

}\

}

/\* bound sqword\_t/dfloat\_t to positive int \*/

#define BOUND\_POS(N) ((int)(MIN(MAX(0, (N)), 2147483647)))

/\* unlink BLK from the hash table bucket chain in SET \*/

static void

unlink\_htab\_ent(struct cache\_t \*cp, /\* cache to update \*/

struct cache\_set\_t \*set, /\* set containing bkt chain \*/

struct cache\_blk\_t \*blk) /\* block to unlink \*/

{

struct cache\_blk\_t \*prev, \*ent;

int index = CACHE\_HASH(cp, blk->tag);

/\* locate the block in the hash table bucket chain \*/

for (prev=NULL,ent=set->hash[index];

ent;

prev=ent,ent=ent->hash\_next)

{

if (ent == blk)

break;

}

assert(ent);

/\* unlink the block from the hash table bucket chain \*/

if (!prev)

{

/\* head of hash bucket list \*/

set->hash[index] = ent->hash\_next;

}

else

{

/\* middle or end of hash bucket list \*/

prev->hash\_next = ent->hash\_next;

}

ent->hash\_next = NULL;

}

/\* insert BLK onto the head of the hash table bucket chain in SET \*/

static void

link\_htab\_ent(struct cache\_t \*cp, /\* cache to update \*/

struct cache\_set\_t \*set, /\* set containing bkt chain \*/

struct cache\_blk\_t \*blk) /\* block to insert \*/

{

int index = CACHE\_HASH(cp, blk->tag);

/\* insert block onto the head of the bucket chain \*/

blk->hash\_next = set->hash[index];

set->hash[index] = blk;

}

/\* where to insert a block onto the ordered way chain \*/

enum list\_loc\_t { Head, Tail };

/\* insert BLK into the order way chain in SET at location WHERE \*/

static void

update\_way\_list(struct cache\_set\_t \*set, /\* set contained way chain \*/

struct cache\_blk\_t \*blk, /\* block to insert \*/

enum list\_loc\_t where) /\* insert location \*/

{

/\* unlink entry from the way list \*/

if (!blk->way\_prev && !blk->way\_next)

{

/\* only one entry in list (direct-mapped), no action \*/

assert(set->way\_head == blk && set->way\_tail == blk);

/\* Head/Tail order already \*/

return;

}

/\* else, more than one element in the list \*/

else if (!blk->way\_prev)

{

assert(set->way\_head == blk && set->way\_tail != blk);

if (where == Head)

{

/\* already there \*/

return;

}

/\* else, move to tail \*/

set->way\_head = blk->way\_next;

blk->way\_next->way\_prev = NULL;

}

else if (!blk->way\_next)

{

/\* end of list (and not front of list) \*/

assert(set->way\_head != blk && set->way\_tail == blk);

if (where == Tail)

{

/\* already there \*/

return;

}

set->way\_tail = blk->way\_prev;

blk->way\_prev->way\_next = NULL;

}

else

{

/\* middle of list (and not front or end of list) \*/

assert(set->way\_head != blk && set->way\_tail != blk);

blk->way\_prev->way\_next = blk->way\_next;

blk->way\_next->way\_prev = blk->way\_prev;

}

/\* link BLK back into the list \*/

if (where == Head)

{

/\* link to the head of the way list \*/

blk->way\_next = set->way\_head;

blk->way\_prev = NULL;

set->way\_head->way\_prev = blk;

set->way\_head = blk;

}

else if (where == Tail)

{

/\* link to the tail of the way list \*/

blk->way\_prev = set->way\_tail;

blk->way\_next = NULL;

set->way\_tail->way\_next = blk;

set->way\_tail = blk;

}

else

panic("bogus WHERE designator");

}

/\* create and initialize a general cache structure \*/

struct cache\_t \* /\* pointer to cache created \*/

cache\_create(char \*name, /\* name of the cache \*/

int nsets, /\* total number of sets in cache \*/

int bsize, /\* block (line) size of cache \*/

int balloc, /\* allocate data space for blocks? \*/

int usize, /\* size of user data to alloc w/blks \*/

int assoc, /\* associativity of cache \*/

enum cache\_policy policy, /\* replacement policy w/in sets \*/

/\* block access function, see description w/in struct cache def \*/

unsigned int (\*blk\_access\_fn)(enum mem\_cmd cmd,

md\_addr\_t baddr, int bsize,

struct cache\_blk\_t \*blk,

tick\_t now),

unsigned int hit\_latency) /\* latency in cycles for a hit \*/

{

struct cache\_t \*cp;

struct cache\_blk\_t \*blk;

int i, j, bindex;

/\* check all cache parameters \*/

if (nsets <= 0)

fatal("cache size (in sets) `%d' must be non-zero", nsets);

if ((nsets & (nsets-1)) != 0)

fatal("cache size (in sets) `%d' is not a power of two", nsets);

/\* blocks must be at least one datum large, i.e., 8 bytes for SS \*/

if (bsize < 8)

fatal("cache block size (in bytes) `%d' must be 8 or greater", bsize);

if ((bsize & (bsize-1)) != 0)

fatal("cache block size (in bytes) `%d' must be a power of two", bsize);

if (usize < 0)

fatal("user data size (in bytes) `%d' must be a positive value", usize);

if (assoc <= 0)

fatal("cache associativity `%d' must be non-zero and positive", assoc);

if ((assoc & (assoc-1)) != 0)

fatal("cache associativity `%d' must be a power of two", assoc);

if (!blk\_access\_fn)

fatal("must specify miss/replacement functions");

/\* allocate the cache structure \*/

cp = (struct cache\_t \*)

calloc(1, sizeof(struct cache\_t) + (nsets-1)\*sizeof(struct cache\_set\_t));

if (!cp)

fatal("out of virtual memory");

/\* initialize user parameters \*/

cp->name = mystrdup(name);

cp->nsets = nsets;

cp->bsize = bsize;

cp->balloc = balloc;

cp->usize = usize;

cp->assoc = assoc;

cp->policy = policy;

cp->hit\_latency = hit\_latency;

/\* miss/replacement functions \*/

cp->blk\_access\_fn = blk\_access\_fn;

/\* compute derived parameters \*/

cp->hsize = CACHE\_HIGHLY\_ASSOC(cp) ? (assoc >> 2) : 0;

cp->blk\_mask = bsize-1;

cp->set\_shift = log\_base2(bsize);

cp->set\_mask = nsets-1;

cp->tag\_shift = cp->set\_shift + log\_base2(nsets);

cp->tag\_mask = (1 << (32 - cp->tag\_shift))-1;

cp->tagset\_mask = ~cp->blk\_mask;

cp->bus\_free = 0;

/\* print derived parameters during debug \*/

debug("%s: cp->hsize = %d", cp->name, cp->hsize);

debug("%s: cp->blk\_mask = 0x%08x", cp->name, cp->blk\_mask);

debug("%s: cp->set\_shift = %d", cp->name, cp->set\_shift);

debug("%s: cp->set\_mask = 0x%08x", cp->name, cp->set\_mask);

debug("%s: cp->tag\_shift = %d", cp->name, cp->tag\_shift);

debug("%s: cp->tag\_mask = 0x%08x", cp->name, cp->tag\_mask);

/\* initialize cache stats \*/

cp->hits = 0;

cp->misses = 0;

cp->replacements = 0;

cp->writebacks = 0;

cp->invalidations = 0;

/\* blow away the last block accessed \*/

cp->last\_tagset = 0;

cp->last\_blk = NULL;

/\* allocate data blocks \*/

cp->data = (byte\_t \*)calloc(nsets \* assoc,

sizeof(struct cache\_blk\_t) +

(cp->balloc ? (bsize\*sizeof(byte\_t)) : 0));

if (!cp->data)

fatal("out of virtual memory");

/\* slice up the data blocks \*/

for (bindex=0,i=0; i<nsets; i++)

{

cp->sets[i].way\_head = NULL;

cp->sets[i].way\_tail = NULL;

/\* get a hash table, if needed \*/

if (cp->hsize)

{

cp->sets[i].hash =

(struct cache\_blk\_t \*\*)calloc(cp->hsize,

sizeof(struct cache\_blk\_t \*));

if (!cp->sets[i].hash)

fatal("out of virtual memory");

}

/\* NOTE: all the blocks in a set \*must\* be allocated contiguously,

otherwise, block accesses through SET->BLKS will fail (used

during random replacement selection) \*/

cp->sets[i].blks = CACHE\_BINDEX(cp, cp->data, bindex);

/\* link the data blocks into ordered way chain and hash table bucket

chains, if hash table exists \*/

for (j=0; j<assoc; j++)

{

/\* locate next cache block \*/

blk = CACHE\_BINDEX(cp, cp->data, bindex);

bindex++;

//\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

//\*\* \*\*

//\*\* Set RRPV Here \*\*

//\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

//blk->rrpv=3; //M=2

blk->rrpv=7; //M=3

//blk->rrpv=15; //M=4

// blk->rrpv=31;//M=5

/\* invalidate new cache block \*/

blk->status = 0;

blk->tag = 0;

blk->ready = 0;

blk->user\_data = (usize != 0

? (byte\_t \*)calloc(usize, sizeof(byte\_t)) : NULL);

/\* insert cache block into set hash table \*/

if (cp->hsize)

link\_htab\_ent(cp, &cp->sets[i], blk);

/\* insert into head of way list, order is arbitrary at this point \*/

blk->way\_next = cp->sets[i].way\_head;

blk->way\_prev = NULL;

if (cp->sets[i].way\_head)

cp->sets[i].way\_head->way\_prev = blk;

cp->sets[i].way\_head = blk;

if (!cp->sets[i].way\_tail)

cp->sets[i].way\_tail = blk;

}

}

return cp;

}

/\* parse policy \*/

enum cache\_policy /\* replacement policy enum \*/

cache\_char2policy(char c) /\* replacement policy as a char \*/

{

switch (c) {

case 'l': return LRU;

case 'r': return Random;

case 'f': return FIFO;

case 's': return SRRIP; //\*\*\*\*\*\*\*\*\* Choose THE SRRIP replacement policy

default: fatal("bogus replacement policy, `%c'", c);

}

}

/\* print cache configuration \*/

void

cache\_config(struct cache\_t \*cp, /\* cache instance \*/

FILE \*stream) /\* output stream \*/

{

fprintf(stream,

"cache: %s: %d sets, %d byte blocks, %d bytes user data/block\n",

cp->name, cp->nsets, cp->bsize, cp->usize);

fprintf(stream,

"cache: %s: %d-way, `%s' replacement policy, write-back\n",

cp->name, cp->assoc,

cp->policy == LRU ? "LRU"

: cp->policy == Random ? "Random"

: cp->policy == FIFO ? "FIFO"

:cp->policy==SRRIP? "SRRIP"

: (abort(), ""));

}

/\* register cache stats \*/

void

cache\_reg\_stats(struct cache\_t \*cp, /\* cache instance \*/

struct stat\_sdb\_t \*sdb) /\* stats database \*/

{

char buf[512], buf1[512], \*name;

/\* get a name for this cache \*/

if (!cp->name || !cp->name[0])

name = "<unknown>";

else

name = cp->name;

sprintf(buf, "%s.accesses", name);

sprintf(buf1, "%s.hits + %s.misses", name, name);

stat\_reg\_formula(sdb, buf, "total number of accesses", buf1, "%12.0f");

sprintf(buf, "%s.hits", name);

stat\_reg\_counter(sdb, buf, "total number of hits", &cp->hits, 0, NULL);

sprintf(buf, "%s.misses", name);

stat\_reg\_counter(sdb, buf, "total number of misses", &cp->misses, 0, NULL);

sprintf(buf, "%s.replacements", name);

stat\_reg\_counter(sdb, buf, "total number of replacements",

&cp->replacements, 0, NULL);

sprintf(buf, "%s.writebacks", name);

stat\_reg\_counter(sdb, buf, "total number of writebacks",

&cp->writebacks, 0, NULL);

sprintf(buf, "%s.invalidations", name);

stat\_reg\_counter(sdb, buf, "total number of invalidations",

&cp->invalidations, 0, NULL);

sprintf(buf, "%s.miss\_rate", name);

sprintf(buf1, "%s.misses / %s.accesses", name, name);

stat\_reg\_formula(sdb, buf, "miss rate (i.e., misses/ref)", buf1, NULL);

sprintf(buf, "%s.repl\_rate", name);

sprintf(buf1, "%s.replacements / %s.accesses", name, name);

stat\_reg\_formula(sdb, buf, "replacement rate (i.e., repls/ref)", buf1, NULL);

sprintf(buf, "%s.wb\_rate", name);

sprintf(buf1, "%s.writebacks / %s.accesses", name, name);

stat\_reg\_formula(sdb, buf, "writeback rate (i.e., wrbks/ref)", buf1, NULL);

sprintf(buf, "%s.inv\_rate", name);

sprintf(buf1, "%s.invalidations / %s.accesses", name, name);

stat\_reg\_formula(sdb, buf, "invalidation rate (i.e., invs/ref)", buf1, NULL);

}

/\* print cache stats \*/

void

cache\_stats(struct cache\_t \*cp, /\* cache instance \*/

FILE \*stream) /\* output stream \*/

{

double sum = (double)(cp->hits + cp->misses);

fprintf(stream,

"cache: %s: %.0f hits %.0f misses %.0f repls %.0f invalidations\n",

cp->name, (double)cp->hits, (double)cp->misses,

(double)cp->replacements, (double)cp->invalidations);

fprintf(stream,

"cache: %s: miss rate=%f repl rate=%f invalidation rate=%f\n",

cp->name,

(double)cp->misses/sum, (double)(double)cp->replacements/sum,

(double)cp->invalidations/sum);

}

/\* access a cache, perform a CMD operation on cache CP at address ADDR,

places NBYTES of data at \*P, returns latency of operation if initiated

at NOW, places pointer to block user data in \*UDATA, \*P is untouched if

cache blocks are not allocated (!CP->BALLOC), UDATA should be NULL if no

user data is attached to blocks \*/

unsigned int /\* latency of access in cycles \*/

cache\_access(struct cache\_t \*cp, /\* cache to access \*/

enum mem\_cmd cmd, /\* access type, Read or Write \*/

md\_addr\_t addr, /\* address of access \*/

void \*vp, /\* ptr to buffer for input/output \*/

int nbytes, /\* number of bytes to access \*/

tick\_t now, /\* time of access \*/

byte\_t \*\*udata, /\* for return of user data ptr \*/

md\_addr\_t \*repl\_addr) /\* for address of replaced block \*/

{

byte\_t \*p = vp;

md\_addr\_t tag = CACHE\_TAG(cp, addr);

md\_addr\_t set = CACHE\_SET(cp, addr);

md\_addr\_t bofs = CACHE\_BLK(cp, addr);

struct cache\_blk\_t \*blk, \*repl = NULL;

int lat = 0;

/\* default replacement address \*/

if (repl\_addr)

\*repl\_addr = 0;

/\* check alignments \*/

if ((nbytes & (nbytes-1)) != 0 || (addr & (nbytes-1)) != 0)

fatal("cache: access error: bad size or alignment, addr 0x%08x", addr);

/\* access must fit in cache block \*/

/\* FIXME:

((addr + (nbytes - 1)) > ((addr & ~cp->blk\_mask) + (cp->bsize - 1))) \*/

if ((addr + nbytes) > ((addr & ~cp->blk\_mask) + cp->bsize))

fatal("cache: access error: access spans block, addr 0x%08x", addr);

/\* permissions are checked on cache misses \*/

/\* check for a fast hit: access to same block \*/

if (CACHE\_TAGSET(cp, addr) == cp->last\_tagset)

{

/\* hit in the same block \*/

blk = cp->last\_blk;

goto cache\_fast\_hit;

}

if (cp->hsize)

{

/\* higly-associativity cache, access through the per-set hash tables \*/

int hindex = CACHE\_HASH(cp, tag);

for (blk=cp->sets[set].hash[hindex];

blk;

blk=blk->hash\_next)

{

if (blk->tag == tag && (blk->status & CACHE\_BLK\_VALID))

goto cache\_hit;

}

}

else

{

/\* low-associativity cache, linear search the way list \*/

for (blk=cp->sets[set].way\_head;

blk;

blk=blk->way\_next)

{

if (blk->tag == tag && (blk->status & CACHE\_BLK\_VALID))

goto cache\_hit;

}

}

/\* cache block not found \*/

/\* \*\*MISS\*\* \*/

cp->misses++;

/\* select the appropriate block to replace, and re-link this entry to

the appropriate place in the way list \*/

int flag=0;

switch (cp->policy) {

case LRU:

case FIFO:

repl = cp->sets[set].way\_tail;

update\_way\_list(&cp->sets[set], repl, Head);

break;

case Random:

{

int bindex = myrand() & (cp->assoc - 1);

repl = CACHE\_BINDEX(cp, cp->sets[set].blks, bindex);

}

break;

case SRRIP:

while (flag==0) {

for (blk=cp->sets[set].way\_head; blk; blk=blk->way\_next) {

if (blk->rrpv==7) {

repl=blk;

flag=1;

break;

}

}

if (flag==0) {

for (blk=cp->sets[set].way\_head; blk; blk=blk->way\_next){

if (blk->rrpv!=7) {

blk->rrpv++;

}

}

}

}

break;

default:

panic("bogus replacement policy");

}

/\* remove this block from the hash bucket chain, if hash exists \*/

if (cp->hsize)

unlink\_htab\_ent(cp, &cp->sets[set], repl);

/\* blow away the last block to hit \*/

cp->last\_tagset = 0;

cp->last\_blk = NULL;

/\* write back replaced block data \*/

if (repl->status & CACHE\_BLK\_VALID)

{

cp->replacements++;

if (repl\_addr)

\*repl\_addr = CACHE\_MK\_BADDR(cp, repl->tag, set);

/\* don't replace the block until outstanding misses are satisfied \*/

lat += BOUND\_POS(repl->ready - now);

/\* stall until the bus to next level of memory is available \*/

lat += BOUND\_POS(cp->bus\_free - (now + lat));

/\* track bus resource usage \*/

cp->bus\_free = MAX(cp->bus\_free, (now + lat)) + 1;

if (repl->status & CACHE\_BLK\_DIRTY)

{

/\* write back the cache block \*/

cp->writebacks++;

lat += cp->blk\_access\_fn(Write,

CACHE\_MK\_BADDR(cp, repl->tag, set),

cp->bsize, repl, now+lat);

}

}

/\* update block tags \*/

repl->tag = tag;

repl->status = CACHE\_BLK\_VALID; /\* dirty bit set on update \*/

//\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

// Update the RRPV value of replaced block \*\*

//\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

repl->rrpv=6; // 0 means near-immediate/ 2^M means distant RRIP/ 2^M-2 means long RRIP

//\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

/\* read data block \*/

lat += cp->blk\_access\_fn(Read, CACHE\_BADDR(cp, addr), cp->bsize,

repl, now+lat);

/\* copy data out of cache block \*/

if (cp->balloc)

{

CACHE\_BCOPY(cmd, repl, bofs, p, nbytes);

}

/\* update dirty status \*/

if (cmd == Write)

repl->status |= CACHE\_BLK\_DIRTY;

/\* get user block data, if requested and it exists \*/

if (udata)

\*udata = repl->user\_data;

/\* update block status \*/

repl->ready = now+lat;

/\* link this entry back into the hash table \*/

if (cp->hsize)

link\_htab\_ent(cp, &cp->sets[set], repl);

/\* return latency of the operation \*/

return lat;

cache\_hit: /\* slow hit handler \*/

/\* \*\*HIT\*\* \*/

cp->hits++;

/\* copy data out of cache block, if block exists \*/

if (cp->balloc)

{

CACHE\_BCOPY(cmd, blk, bofs, p, nbytes);

}

/\* update dirty status \*/

if (cmd == Write)

blk->status |= CACHE\_BLK\_DIRTY;

/\* if LRU replacement and this is not the first element of list, reorder \*/

if (blk->way\_prev && cp->policy == LRU)

{

/\* move this block to head of the way (MRU) list \*/

update\_way\_list(&cp->sets[set], blk, Head);

}

//\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

// Update the RRPV if replacement policy is SRRIP \*\*

//\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

if (blk->rrpv!=0 && cp->policy==SRRIP) {

/\* set RRPV of block to 0 \*/

//blk->rrpv=0; //HP

blk->rrpv=blk->rrpv-1; //FP

}

//\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

/\* tag is unchanged, so hash links (if they exist) are still valid \*/

/\* record the last block to hit \*/

cp->last\_tagset = CACHE\_TAGSET(cp, addr);

cp->last\_blk = blk;

/\* get user block data, if requested and it exists \*/

if (udata)

\*udata = blk->user\_data;

/\* return first cycle data is available to access \*/

return (int) MAX(cp->hit\_latency, (blk->ready - now));

cache\_fast\_hit: /\* fast hit handler \*/

/\* \*\*FAST HIT\*\* \*/

cp->hits++;

/\* copy data out of cache block, if block exists \*/

if (cp->balloc)

{

CACHE\_BCOPY(cmd, blk, bofs, p, nbytes);

}

/\* update dirty status \*/

if (cmd == Write)

blk->status |= CACHE\_BLK\_DIRTY;

/\* this block hit last, no change in the way list \*/

/\* this block hit last, no change in the RRPV value \*/

if (blk->rrpv!=0 && cp->policy==SRRIP) {

/\* set RRPV of block to 0 \*/

//blk->rrpv=0; //HP

blk->rrpv=blk->rrpv-1; //FP

}

/\* tag is unchanged, so hash links (if they exist) are still valid \*/

/\* get user block data, if requested and it exists \*/

if (udata)

\*udata = blk->user\_data;

/\* record the last block to hit \*/

cp->last\_tagset = CACHE\_TAGSET(cp, addr);

cp->last\_blk = blk;

/\* return first cycle data is available to access \*/

return (int) MAX(cp->hit\_latency, (blk->ready - now));

}

**Appendix B: Simulation Sample**

*This is just a sample of simulation results.*

sim-outorder: SimpleScalar/Alpha Tool Set version 3.0 of August, 2003.

Copyright (c) 1994-2003 by Todd M. Austin, Ph.D. and SimpleScalar, LLC.

All Rights Reserved. This version of SimpleScalar is licensed for academic

non-commercial use. No portion of this work may be used by any commercial

entity, or for any commercial purpose, without the prior written permission

of SimpleScalar, LLC (info@simplescalar.com).

warning: section `.comment' ignored...

sim: command line: ../../simplesim-3.0/sim-outorder -max:inst 50000000 -fastfwd 20000000 -redir:sim 2W\_s.txt -bpred 2lev -bpred:2lev 1 256 4 0 -bpred:ras 8 -bpred:btb 64 2 -cache:dl1 dl1:128:64:4:l -cache:dl2 dl2:8192:64:2:s ../../spec2000binaries/art00.peak.ev6 -scanfile c756hel.in -trainfile1 a10.img -trainfile2 hc.img -stride 2 -startx 110 -starty 200 -endx 160 -endy 240 -objects 10

sim: simulation started @ Sun Dec 1 16:51:35 2013, options follow:

sim-outorder: This simulator implements a very detailed out-of-order issue

superscalar processor with a two-level memory system and speculative

execution support. This simulator is a performance simulator, tracking the

latency of all pipeline operations.

# -config # load configuration from a file

# -dumpconfig # dump configuration to a file

# -h false # print help message

# -v false # verbose operation

# -d false # enable debug message

# -i false # start in Dlite debugger

-seed 1 # random number generator seed (0 for timer seed)

# -q false # initialize and terminate immediately

# -chkpt <null> # restore EIO trace execution from <fname>

# -redir:sim 2W\_s.txt # redirect simulator output to file (non-interactive only)

# -redir:prog <null> # redirect simulated program output to file

-nice 0 # simulator scheduling priority

-max:inst 50000000 # maximum number of inst's to execute

-fastfwd 20000000 # number of insts skipped before timing starts

# -ptrace <null> # generate pipetrace, i.e., <fname|stdout|stderr> <range>

-fetch:ifqsize 4 # instruction fetch queue size (in insts)

-fetch:mplat 3 # extra branch mis-prediction latency

-fetch:speed 1 # speed of front-end of machine relative to execution core

-bpred 2lev # branch predictor type {nottaken|taken|perfect|bimod|2lev|comb|tournament}

-bpred:bimod 2048 # bimodal predictor config (<table size>)

-bpred:2lev 1 256 4 0 # 2-level predictor config (<l1size> <l2size> <hist\_size> <xor>)

-bpred:comb 1024 # combining predictor config (<meta\_table\_size>)

-bpred:tournament 4906 12 1024 1024 # tournament predictor config (<sel\_size> <global\_regsize> <local\_htb\_size> <local\_hrsize>)

-bpred:ras 8 # return address stack size (0 for no return stack)

-bpred:btb 64 2 # BTB config (<num\_sets> <associativity>)

# -bpred:spec\_update <null> # speculative predictors update in {ID|WB} (default non-spec)

-decode:width 4 # instruction decode B/W (insts/cycle)

-issue:width 4 # instruction issue B/W (insts/cycle)

-issue:inorder false # run pipeline with in-order issue

-issue:wrongpath true # issue instructions down wrong execution paths

-commit:width 4 # instruction commit B/W (insts/cycle)

-ruu:size 16 # register update unit (RUU) size

-lsq:size 8 # load/store queue (LSQ) size

-cache:dl1 dl1:128:64:4:l # l1 data cache config, i.e., {<config>|none}

-cache:dl1lat 1 # l1 data cache hit latency (in cycles)

-cache:dl2 dl2:8192:64:2:s # l2 data cache config, i.e., {<config>|none}

-cache:dl2lat 6 # l2 data cache hit latency (in cycles)

-cache:il1 il1:512:32:1:l # l1 inst cache config, i.e., {<config>|dl1|dl2|none}

-cache:il1lat 1 # l1 instruction cache hit latency (in cycles)

-cache:il2 dl2 # l2 instruction cache config, i.e., {<config>|dl2|none}

-cache:il2lat 6 # l2 instruction cache hit latency (in cycles)

-cache:flush false # flush caches on system calls

-cache:icompress false # convert 64-bit inst addresses to 32-bit inst equivalents

-mem:lat 18 2 # memory access latency (<first\_chunk> <inter\_chunk>)

-mem:width 8 # memory access bus width (in bytes)

-tlb:itlb itlb:16:4096:4:l # instruction TLB config, i.e., {<config>|none}

-tlb:dtlb dtlb:32:4096:4:l # data TLB config, i.e., {<config>|none}

-tlb:lat 30 # inst/data TLB miss latency (in cycles)

-res:ialu 4 # total number of integer ALU's available

-res:imult 1 # total number of integer multiplier/dividers available

-res:memport 2 # total number of memory system ports available (to CPU)

-res:fpalu 4 # total number of floating point ALU's available

-res:fpmult 1 # total number of floating point multiplier/dividers available

# -pcstat <null> # profile stat(s) against text addr's (mult uses ok)

-bugcompat false # operate in backward-compatible bugs mode (for testing only)

Pipetrace range arguments are formatted as follows:

{{@|#}<start>}:{{@|#|+}<end>}

Both ends of the range are optional, if neither are specified, the entire

execution is traced. Ranges that start with a `@' designate an address

range to be traced, those that start with an `#' designate a cycle count

range. All other range values represent an instruction count range. The

second argument, if specified with a `+', indicates a value relative

to the first argument, e.g., 1000:+100 == 1000:1100. Program symbols may

be used in all contexts.

Examples: -ptrace FOO.trc #0:#1000

-ptrace BAR.trc @2000:

-ptrace BLAH.trc :1500

-ptrace UXXE.trc :

-ptrace FOOBAR.trc @main:+278

Branch predictor configuration examples for 2-level predictor:

Configurations: N, M, W, X

N # entries in first level (# of shift register(s))

W width of shift register(s)

M # entries in 2nd level (# of counters, or other FSM)

X (yes-1/no-0) xor history and address for 2nd level index

Sample predictors:

GAg : 1, W, 2^W, 0

GAp : 1, W, M (M > 2^W), 0

PAg : N, W, 2^W, 0

PAp : N, W, M (M == 2^(N+W)), 0

gshare : 1, W, 2^W, 1

Predictor `comb' combines a bimodal and a 2-level predictor.

The cache config parameter <config> has the following format:

<name>:<nsets>:<bsize>:<assoc>:<repl>

<name> - name of the cache being defined

<nsets> - number of sets in the cache

<bsize> - block size of the cache

<assoc> - associativity of the cache

<repl> - block replacement strategy, 'l'-LRU, 'f'-FIFO, 'r'-random, 's'-SSRIP

Examples: -cache:dl1 dl1:4096:32:1:l

-dtlb dtlb:128:4096:32:r

Cache levels can be unified by pointing a level of the instruction cache

hierarchy at the data cache hiearchy using the "dl1" and "dl2" cache

configuration arguments. Most sensible combinations are supported, e.g.,

A unified l2 cache (il2 is pointed at dl2):

-cache:il1 il1:128:64:1:l -cache:il2 dl2

-cache:dl1 dl1:256:32:1:l -cache:dl2 ul2:1024:64:2:l

Or, a fully unified cache hierarchy (il1 pointed at dl1):

-cache:il1 dl1

-cache:dl1 ul1:256:32:1:l -cache:dl2 ul2:1024:64:2:l

sim: \*\* fast forwarding 20000000 insts \*\*

sim: \*\* starting performance simulation \*\*

sim: \*\* simulation statistics \*\*

sim\_num\_insn 50000000 # total number of instructions committed

sim\_num\_refs 18898516 # total number of loads and stores committed

sim\_num\_loads 15101474 # total number of loads committed

sim\_num\_stores 3797042.0000 # total number of stores committed

sim\_num\_branches 5652369 # total number of branches committed

sim\_elapsed\_time 70 # total simulation time in seconds

sim\_inst\_rate 714285.7143 # simulation speed (in insts/sec)

sim\_total\_insn 52060555 # total number of instructions executed

sim\_total\_refs 18909716 # total number of loads and stores executed

sim\_total\_loads 15111343 # total number of loads executed

sim\_total\_stores 3798373.0000 # total number of stores executed

sim\_total\_branches 6400085 # total number of branches executed

sim\_cycle 76081542 # total simulation time in cycles

sim\_IPC 0.6572 # instructions per cycle

sim\_CPI 1.5216 # cycles per instruction

sim\_exec\_BW 0.6843 # total instructions (mis-spec + committed) per cycle

sim\_IPB 8.8458 # instruction per branch

IFQ\_count 294367876 # cumulative IFQ occupancy

IFQ\_fcount 72000579 # cumulative IFQ full count

ifq\_occupancy 3.8691 # avg IFQ occupancy (insn's)

ifq\_rate 0.6843 # avg IFQ dispatch rate (insn/cycle)

ifq\_latency 5.6543 # avg IFQ occupant latency (cycle's)

ifq\_full 0.9464 # fraction of time (cycle's) IFQ was full

RUU\_count 998779473 # cumulative RUU occupancy

RUU\_fcount 21943415 # cumulative RUU full count

ruu\_occupancy 13.1278 # avg RUU occupancy (insn's)

ruu\_rate 0.6843 # avg RUU dispatch rate (insn/cycle)

ruu\_latency 19.1850 # avg RUU occupant latency (cycle's)

ruu\_full 0.2884 # fraction of time (cycle's) RUU was full

LSQ\_count 475655555 # cumulative LSQ occupancy

LSQ\_fcount 48851906 # cumulative LSQ full count

lsq\_occupancy 6.2519 # avg LSQ occupancy (insn's)

lsq\_rate 0.6843 # avg LSQ dispatch rate (insn/cycle)

lsq\_latency 9.1366 # avg LSQ occupant latency (cycle's)

lsq\_full 0.6421 # fraction of time (cycle's) LSQ was full

sim\_slip 1538829233 # total number of slip cycles

avg\_sim\_slip 30.7766 # the average slip between issue and retirement

bpred\_2lev.lookups 6686562 # total number of bpred lookups

bpred\_2lev.updates 5652364 # total number of updates

bpred\_2lev.addr\_hits 5365280 # total number of address-predicted hits

bpred\_2lev.dir\_hits 5365314 # total number of direction-predicted hits (includes addr-hits)

bpred\_2lev.misses 287050 # total number of misses

bpred\_2lev.jr\_hits 40007 # total number of address-predicted hits for JR's

bpred\_2lev.jr\_seen 40008 # total number of JR's seen

bpred\_2lev.jr\_non\_ras\_hits.PP 0 # total number of address-predicted hits for non-RAS JR's

bpred\_2lev.jr\_non\_ras\_seen.PP 0 # total number of non-RAS JR's seen

bpred\_2lev.bpred\_addr\_rate 0.9492 # branch address-prediction rate (i.e., addr-hits/updates)

bpred\_2lev.bpred\_dir\_rate 0.9492 # branch direction-prediction rate (i.e., all-hits/updates)

bpred\_2lev.bpred\_jr\_rate 1.0000 # JR address-prediction rate (i.e., JR addr-hits/JRs seen)

bpred\_2lev.bpred\_jr\_non\_ras\_rate.PP <error: divide by zero> # non-RAS JR addr-pred rate (ie, non-RAS JR hits/JRs seen)

bpred\_2lev.retstack\_pushes 42175 # total number of address pushed onto ret-addr stack

bpred\_2lev.retstack\_pops 40024 # total number of address popped off of ret-addr stack

bpred\_2lev.used\_ras.PP 40008 # total number of RAS predictions used

bpred\_2lev.ras\_hits.PP 40007 # total number of RAS hits

bpred\_2lev.ras\_rate.PP 1.0000 # RAS prediction rate (i.e., RAS hits/used RAS)

il1.accesses 53208276 # total number of accesses

il1.hits 53208031 # total number of hits

il1.misses 245 # total number of misses

il1.replacements 4 # total number of replacements

il1.writebacks 0 # total number of writebacks

il1.invalidations 0 # total number of invalidations

il1.miss\_rate 0.0000 # miss rate (i.e., misses/ref)

il1.repl\_rate 0.0000 # replacement rate (i.e., repls/ref)

il1.wb\_rate 0.0000 # writeback rate (i.e., wrbks/ref)

il1.inv\_rate 0.0000 # invalidation rate (i.e., invs/ref)

dl1.accesses 18643249 # total number of accesses

dl1.hits 11979587 # total number of hits

dl1.misses 6663662 # total number of misses

dl1.replacements 6663150 # total number of replacements

dl1.writebacks 1334027 # total number of writebacks

dl1.invalidations 0 # total number of invalidations

dl1.miss\_rate 0.3574 # miss rate (i.e., misses/ref)

dl1.repl\_rate 0.3574 # replacement rate (i.e., repls/ref)

dl1.wb\_rate 0.0716 # writeback rate (i.e., wrbks/ref)

dl1.inv\_rate 0.0000 # invalidation rate (i.e., invs/ref)

dl2.accesses 7997934 # total number of accesses

dl2.hits 4506649 # total number of hits

dl2.misses 3491285 # total number of misses

dl2.replacements 3474901 # total number of replacements

dl2.writebacks 277408 # total number of writebacks

dl2.invalidations 0 # total number of invalidations

dl2.miss\_rate 0.4365 # miss rate (i.e., misses/ref)

dl2.repl\_rate 0.4345 # replacement rate (i.e., repls/ref)

dl2.wb\_rate 0.0347 # writeback rate (i.e., wrbks/ref)

dl2.inv\_rate 0.0000 # invalidation rate (i.e., invs/ref)

itlb.accesses 53208276 # total number of accesses

itlb.hits 53208270 # total number of hits

itlb.misses 6 # total number of misses

itlb.replacements 0 # total number of replacements

itlb.writebacks 0 # total number of writebacks

itlb.invalidations 0 # total number of invalidations

itlb.miss\_rate 0.0000 # miss rate (i.e., misses/ref)

itlb.repl\_rate 0.0000 # replacement rate (i.e., repls/ref)

itlb.wb\_rate 0.0000 # writeback rate (i.e., wrbks/ref)

itlb.inv\_rate 0.0000 # invalidation rate (i.e., invs/ref)

dtlb.accesses 18907132 # total number of accesses

dtlb.hits 18723506 # total number of hits

dtlb.misses 183626 # total number of misses

dtlb.replacements 183498 # total number of replacements

dtlb.writebacks 0 # total number of writebacks

dtlb.invalidations 0 # total number of invalidations

dtlb.miss\_rate 0.0097 # miss rate (i.e., misses/ref)

dtlb.repl\_rate 0.0097 # replacement rate (i.e., repls/ref)

dtlb.wb\_rate 0.0000 # writeback rate (i.e., wrbks/ref)

dtlb.inv\_rate 0.0000 # invalidation rate (i.e., invs/ref)

sim\_invalid\_addrs 0 # total non-speculative bogus addresses seen (debug var)

ld\_text\_base 0x0120000000 # program text (code) segment base

ld\_text\_size 237568 # program text (code) size in bytes

ld\_data\_base 0x0140000000 # program initialized data segment base

ld\_data\_size 76672 # program init'ed `.data' and uninit'ed `.bss' size in bytes

ld\_stack\_base 0x011ff9b000 # program stack segment base (highest address in stack)

ld\_stack\_size 16384 # program initial stack size

ld\_prog\_entry 0x012000b410 # program entry point (initial PC)

ld\_environ\_base 0x011ff97000 # program environment base address address

ld\_target\_big\_endian 0 # target executable endian-ness, non-zero if big endian

mem.page\_count 496 # total number of pages allocated

mem.page\_mem 3968k # total size of memory pages allocated

mem.ptab\_misses 261432 # total first level page table misses

mem.ptab\_accesses 529442078 # total page table accesses

mem.ptab\_miss\_rate 0.0005 # first level page table miss rate